Question 1

The \((X'X)^{-1}\) for the \(y=β_0+β_1 x_1+β_2 x_2+β_3 x_3+β_4 x_4+β_5 x_5+β_6 x_6+ε\) is given below.

  1. If MSE = 1.395 and n = 38 , compute the

\[se(\mathbf{\hat\beta_4})=\sqrt{MSE\times C_{55}}=\sqrt{1.395\times0.069}=0.3102499\]

\[Cov(\mathbf{\hat\beta_2,\hat\beta_4})=MSE\times C_{35}=1.395\times(-0.035)=-0.048825\]

\[se(\mathbf{\hat\beta_2})=\sqrt{MSE\times C_{33}}=\sqrt{1.395\times0.067}=0.3057205\]

\[Cor(\mathbf{\hat\beta_2,\hat\beta_4})=\frac{Cov(\mathbf{\hat\beta_2,\hat\beta_4})}{se(\mathbf{\hat\beta_2})se(\mathbf{\hat\beta_4})}=\frac{-0.048825}{0.3057205\times0.3102499}=-0.5147615\]

\(C_{66}=0.058\) has the smallest value. \(\hatβ_5\) has the the least variance and is the most consistent among the estimators.

According to the \((X'X)^{(-1)}\), \(C_{13},\ C_{17},\ C_{24},\ C_{25},\ C_{67}\) are positive.

Therefore, the positively correlated pairs of parameters are

\[\hatβ_0\ \&\ \hatβ_2,\quad \hatβ_0\ \&\ \hatβ_6,\quad \hatβ_1\ \&\ \hatβ_3,\quad \hatβ_1\ \&\ \hatβ_4,\quad \hatβ_5\ \&\ \hatβ_6\]

  1. Consider the following hypothesis: \(H_0: β_1=2β_3,β_2=β_3,β_5=0\)

\[ \mathbf{T}=\begin{bmatrix} 0 & 1 & 0 & -2 & 0 & 0& 0 \\ 0 & 0 & 1 & -1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \end{bmatrix}_{3\times7} \mathbf{β}=\begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \\ \beta_6 \end{bmatrix}_{7\times1} \mathbf{C}=\begin{bmatrix} 0 \\ 0 \\ 0\end{bmatrix}_{3\times1} rank(T)=3 \]

In this hypothesis, \(y=β_0+2β_3x_1+β_3x_2+β_3x_3+β_4x_4+0x_5+β_6x_6+ε=β_0+β_3(2x_1+x_2+x_3)+β_4x_4+β_6x_6+ε\)

The full model has 6 parameters for predictors while reduced model has 3.

\[F_0=\frac{\frac{SSE_{Reduced}-SSE_{Full}}{dfE_{Reduced}-dfE_{Full}}}{\frac{SSE_{Full}}{dfE_{Full}}}=\frac{\frac{SSE_{Reduced}-SSE_{Full}}{[n-(k+1-r)]-[n-(k+1)]}}{\frac{SSE_{Full}}{n-(k+1)}}=\frac{\frac{SSE_{Reduced}-SSE_{Full}}{[38-(6+1-3)]-[38-(6+1)]}}{\frac{SSE_{Full}}{38-(6+1)}}=\frac{\frac{SSE_{Reduced}-SSE_{Full}}{3}}{\frac{SSE_{Full}}{31}}\]

In this conceptual form, the numerator degrees of freedom is \(\nu_1=3\), denominator is \(\nu_2=31\).

After transformed to a colsed form, The numerator is 31, denominator is 3.

\[SSR=\sum_{i=1}^n(\hat y_i-\bar y)^2=\sum_{i=1}^n(\hat y_i^2-2\hat y_i\bar y+\bar y^2)=\sum_{i=1}^n\hat y_i^2-2\bar y\sum_{i=1}^n\hat y_i+\sum_{i=1}^n\bar y^2\]

\[=\sum_{i=1}^n\hat y_i^2-2\bar yn\frac{\sum_{i=1}^n\hat y_i}n+n\bar y^2=\sum_{i=1}^n\hat y_i^2-2\bar yn\bar y+n\bar y^2=\sum_{i=1}^n\hat y_i^2-n\bar y^2\]

Question 2

  1. The matrix of scatterplots and correlation


  1. Which predictors are significantly related to (most likely to contribute to the variation in) the response variable.

Based on scatterplots and correlation, \(Cor(y,x_4)=0.866, Cor(y,x_1)=0.781, Cor(y,x_7)=0.668, Cor(y,x_2)=0.666\) have medium to strong positive linear relationship to the response variable (Correlation coefficient is more than 0.6). \(Cor(y,x_5)=-0.62\) has medium negative linear relationship to the response variable.


  1. Fit the full model.

\[\hat y=292.561-203.144X_1+ 1055.782X_2-49.24X_3+209.762X_4-10.197X_5-24.558X_6+142.778X_7+511.713X_8-301.872X_9\]


  1. Significant

The fitted overall model is statistically significant at 5% significance level (p-value=\(9.744\times^{-06}\)).

But most of the coefficients are not significent. This model is not the best fitted model.


  1. The violation of random errors
  • Residual Diagnostics: There is some violations. The model didn’t satisfied the OLS assumptions of random errors.

On the residual plot, there is a funnel pattern.

On the outlier and leverage plot, there are two outliers.

On the qq plot, most of points follow approximately straight line but have some positive skew.

  • Suggestion: Transform and more diagnostics.

I suggest using natural log of response to make a variance-stabilizing transformations.

Other diagnostics of heteroskedasticity, variable selection, measures of influence also should be considered.


  1. The partial sum of squares explained by rainfall.

Accroding to the F test, the partial sum of squares explained by rainfall is 2209825, given that all the other regression coefficients are in the model.


  1. Multicollinearity

According to the result of VIF test (variance inflation factor), the model does have serious problems of multicollinearity. The VIF of variables X4 (105.754708), X1 (101.859709), X3 (31.446394), X7(20.53505) are larger than 10.


  1. Interpret the estimated coefficient of rainfall predictor

Coefficient of 511.713 in the full model suggests the average peak rate of flow increases by 511.713 cubic feet per second when the rainfall increases by 1 inch and other variables are constants.


  1. Log of response

\[\hat{\ln(y)}=3.402256-0.013532X_1-1.023664X_2+0.177966X_3+0.108788X_4\] \[-0.009622X_5-0.389474X_6+4.233475X_7+0.63007X_8-0.462276X_9\]

  1. Significant

The overall fitted model is statistically significant at 5% significance level (p-value=\(7.513\times10^{-11}\)).

But most of the coefficients are not significent. This model is not the best fitted model.


  1. Multicollinearity.

The model still has serious problems of multicollinearity. The variance-stabilizing transformations does not change the value of VIF: X4 (105.754708), X1 (101.859709), X3 (31.446394), X7(20.53505).


  1. If you wanted to simplify this full model, explain which predictor you would eliminate first.
  • VIF

If just considering the VIF, X4 (105.754708) or X1 (101.859709) with largest VIF values is the first to remove.

  • Correlation with ln(y)

However, according to the correlation coefficients, X4, X1, and X7 have strongly correlation with ln(y) (\(Cor(ln(y),x_4)=0.896, Cor(ln(y),x_1)=0.726, Cor(ln(y),x_7)=0.592\)). The textbook suggest that the general approaches for dealing with multicollinearity include collecting additional data, model respecification (redefine the regressors, variable elimination), estimation methods (Ridge Regression, Principal-Component Regression). “Variable elimination is often a highly effective technique. However, it may not provide a satisfactory solution if the regressors dropped from the model have significant explanatory power relative to the response y. That is, eliminating regressors to reduce multicollinearity may damage the predictive power of the model.” (Montgomery et al., 2012. p.304) In this way, the third multicollinear X3 (31.446394) with a weak relationship with ln(y) (0.476) should be considered.

  • Correlation with each other

According to the variable names of X4, X1, and X3, they are geographic variables. Predictor X1 is the area of watershed while X4 is the longest stream flow in watershed, x3 is the average slope of watershed. For the given 6 watersheds, X1 and X4 are strongly related. A high correlation (0.921) is expected between these two variables. But X3 is not significently related with X1(-0.078) or X4 (0.245). Removing X3 might lose some irreplacable infromation. If we stop at here, having less corelation with response than X4, the variable eliminating first is X1.

  • Elimination regression

Actrually, I don’t agree remove any predictor in this stage. Removing any predictor can draw down the VIF significently. Before elimination, X4 contribute most of VIF. After elimination regression, the multicollinearity dissapeared in all the models. We should take more diagnostics and comparisons, gather sufficient evidents before removing any predictor.

remove Max.VIF R-squared Eliminated model New Max.VIF New R-squared
none 106(X4) 0.947 / ,/ ,X3,X4,/ ,X6,X7,X8,X9 X7=6.28 0.947
X1 8.70(X4) 0.947 / ,/ ,X3,X4,/ ,X6,X7,X8,X9 X7=6.28 0.947
X2 67.4(X4) 0.947 / ,/ ,X3,X4,/ ,X6,X7,X8,X9 X7=6.28 0.947
X3 8.87(X4) 0.941 X1,X2,/ ,X4,X5,/ ,/ ,X8,X9 X1=8.39 0.937
X4 8.38(X1) 0.945 X1,/ ,X3,/ ,/ ,X6,X7,X8,X9 X9=5.10 0.943
X5 58.4(X4) 0.947 / ,/ ,X3,X4,/ ,X6,X7,X8,X9 X7=6.28 0.947
X6 54.8(X4) 0.939 X1,X2,/ ,X4,X5,/ ,/ ,X8,X9 X1=8.39 0.937
X7 42.0(X1) 0.939 X1,X2,/ ,X4,X5,/ ,/ ,X8,X9 X1=8.39 0.937
X8 103(X4) 0.900 X1,/ ,X3,/ ,/ ,X6,X7,/ ,/ X7=3.96 0.893
X9 100(X4) 0.910 X1,/ ,X3,/ ,/ ,X6,X7,X8,/ X7=3.97 0.906

Before elimination regression, We can find removing x8 or x9 hurts the R-squared most, although they have low correlation with ln(y). Removing x1, x2, or x5 affects lesat. After elimination, We can get 3 kinds of model:‘346789’(\(R^2=0.947\)); ‘124589’(\(R^2=0.937\)); ‘1367(8)(9)’(\(R^2=0.943-0.906\)). X8 appeared in all the eliminated models unless we remove it manually. X1 apeared in 5 eliminated models. X2 and X5 apeared only in 3 models.

Turning back to the correlation coefficients, X5 have stronger correlation with ln(y) than X2 (\(Cor(\ln(y),X_2)=0.658, Cor(\ln(y),X_5)=-0.723\)). We look at the context, X1 (Area of watershed) is a variable of terrain, X5 (surface absorbency index) and X2 (Area impervious to water) which are varibles of surface properties, which have similar effects on initial flow (see the discussion at the end). Based on the context, X1 is more like a independent varibale than X2 and X5.

Finaliy, Among the four variables (X2, X5, X6, X7) of surface porperties related with initial flow, X2 and X7 have highest correlation, which means X7 may contain the most same information contained in X2. Therefore, if we have to remove one variable, X2 is the best option. I will proof removing X1 is not good in the part of discussion.

X1 X2 X3 X4 X5 X6 X7 X8 X9 ln(y)
X2 0.80125933 1.00000000 -0.07302375 0.76056089 -0.48607440 0.06598115 0.8324480 0.10540203 0.13331333 0.65824739
X5 -0.73673561 -0.48607440 -0.40289059 -0.77701188 1.00000000 -0.27043338 -0.4814787 -0.04005129 -0.10502725 -0.72309770

  1. The forward selection method (use α=0.15)

Use Stepwise Forward Regression based on p values (use α=0.15)

\[\hat{\ln(y)}=2.872+0.168X_3+0.122X_4+3.106X_7\]

Use Stepwise AIC Forwardd Regression

\[\hat{\ln(y)}=2.692+0.184X_3+0.109X_4-0.368X_6+4.085X_7+0.612X_8-0.448X_9\]


  1. The backward elimination method (use α=0.05)

Stepwise Backward Regression based on p values (use α=0.05) and Stepwise AIC Backward Regression have same results.

\[\hat{\ln(y)}=2.692+0.184X_3+0.109X_4-0.368X_6+4.085X_7+0.612X_8-0.448X_9\]


  1. Best subsets method

Best subsets method gives a same model.

\[\hat{\ln(y)}=2.692+0.184X_3+0.109X_4-0.368X_6+4.085X_7+0.612X_8-0.448X_9\]


  1. Compare and suggest one best model
Method By Keep Remove
Stepwise Forward P=0.15 X3, X4, X7 X1,X2,X5,X6,X8,X9
Stepwise Forward AIC X3,X4,X6,X7,X8,X9 X1,X2,X5
Stepwise Backward P=0.05 X3,X4,X6,X7,X8,X9 X1,X2,X5
Stepwise Backward AIC X3,X4,X6,X7,X8,X9 X1,X2,X5
Stepwise Both P X3, X4, X7 X1,X2,X5,X6,X8,X9
Stepwise Both AIC X3,X4,X6,X7,X8,X9 X1,X2,X5
Best Subset p X3,X4,X6,X7,X8,X9 X1,X2,X5
Best Subset AIC X3,X4,X6,X7,X8,X9 X1,X2,X5
all possible / X3,X4,X6,X7,X8,X9 X1,X2,X5

Both models solved the problem of multicollinearity (VIF <10), and small P-values for F test. They don’t have serious violation of assumptions about the errors (There is no significant pattern on the plot of studentized residuals versus predicted values from the model with only one predictor. The partial regression plots do not show nonlinear patterns. The points follow approximately straight line on the qq plot). Both of Correlation between observed residuals and expected residuals under normality.The 6-predictor model got 0.9837263 P-value while the 6-predictor model got 0.9856766.

Model VIF F P-value(F) MSR MSE \(R_{adjusted}^2\) \(R_{Predict}^2\) P-value(t) Residuals Plots
3-4-7 <10 70.378 \(1.312\times10^{-12}\) 21.188 0.301 0.878 0.854 Max=0.054 Good enough
3-4-6-7-8-9 <10 68.16 \(1.717\times10^{-13}\) 11.265 0.165 0.933 0.908 Max=0.019 Good enough

However, comparing to the 3-variable model, the 6-variable model has a higher (about by 6%) adjusted R square and higher (about by 5%) prediction R-square, which means it shows stronger predictive capability. All the coeficients in 6-predictors model are statistically significant higher than 98% significance level (the maximum p-values are 0.019, respectively). In the 3-variable model, X7 get a high p-value (0.054) which means not significant at 5% significance level. If we change the p-value as the parameter of forward selection, the same model will happened between \(\alpha\) equal 0.6 and 0.17. Further, considering the context, X8 and X9 are variables of precipitation. The 3-predictor model mean the peak flow is irrelevant with precipitation. It doesn’t make sense. Previous test in question (3) has found X8 and X9 are important. Therefore, the best model will be the 3-4-6-7-8-9 model with 6 predictors.


  1. Provide complete ANOVA table for the best model. Provide partial sum of squares, estimated coefficients, standard errors, p-values, 95% Bonferroni joint confidence intervals for the coefficients of the best model. Provide in a tabular form clearly.
Model Summary
R 0.973 RMSE (Root Mean Square Error) 0.407
R-Squared 0.947 Coef. Var 6.385
Adj. R-Squared 0.933 MSE (Mean Square Error) 0.165
Pred R-Squared 0.908 MAE (Mean Absolute Error) 0.273
ANOVA
Sum of Squares DF Mean Square F p-value
Regression 67.591 6 11.265 68.16 \(1.717\times10^{-13}\)
Residual 3.801 23 0.165
Total 71.393 29
Parameter Estimates
model Estimated coefficients Partial SS Std. Error t test p-value 0.357 % 99.643 %
(Intercept) 2.69180 / 0.445 6.046 \(3.63\times10^{-06}\) 1.37732232 4.00627202
X3 0.18384 5.37 0.032 5.698 \(8.41\times10^{-06}\) 0.08857700 0.27911109
X4 0.10905 2.98 0.026 4.244 0.000306 0.03318876 0.18491189
X6 -0.36752 1.05 0.146 -2.526 0.018898 -0.79716475 0.06213151
X7 4.08497 1.87 1.213 3.367 0.002662 0.50312634 7.66681371
X8 0.61161 3.52 0.133 4.614 0.000122 0.22022202 1.00298907
X9 -0.44764 2.83 0.108 -4.135 0.000402 -0.76727751 -0.12799849

By the jiont confident interval, we find X6 may not significant different with zero. If we aren’t limited in the models in the question (4)(5)(6), I try to find a better model in the part of discussion.


  1. How much variation in the response is explained by the best model after taking number of data and regression coefficients in to account?

By SSR equal 67.591 and SSE equal 3.801, the adjusted R-squared is 0.9329. About 93.29% variation in the response is explained by the best model.


  1. Report the PRESS statistic of the best model.

The value of PRESS is 6.538275. This model explains 90.8% of variation in predicting the peak rate of flow (in cfs) of water from six watersheds following storm episodes.


Discussion

  • Indepency and grouping

Linear regression is one of the Watershed Hydrological Model. Singh (1972) used linear models with a logarithm transformation of the variables.

\[\ln(Dependent\ Variable)=β_0+β_1\ln P+β_2\ln Q+β_3\ln F_P+β_4CC+Interactions+\varepsilon\]

where the dependent variables can either be total storm flow volume (\(Q_t\)) in mm, quick flow volume (\(Q_f\)) in mm or peak flow (\(Q_{pk}\)) in \(m^3 sec^{−1} km^{−2}\). Independent variables were storm rainfall (\(P\)) in \(mm\), initial flow (\(Q_i\)) in \(mm h^{−1}\), rainfall frequency (\(F_p\)), the inverse of rainfall duration, in \(h^{−1}\) and a dummy variable (\(CC\)) representing the treatment effect on basin. \(CC\) was 0 and 1 for the calibration (1967–1992) and treated (1994–1998) periods, respectively. \(β_0\) to \(β_4\) are regression coefficients of the independent variables. All interactions between the independent variables were also tested for significance at \(α=0.10\). (Guillemette et al., 2005)

Inspired by this theory, I divided the variables to 4 groups:

P F Q
Precipitation Time Terrain Surface
x8 x9 x1: Area of watershed (\(mi^2\)) x2: Area impervious to water (\(mi^2\))
Rainfall Time period during x3: Average slope of watershed (percent) x5: surface absorbency index
(inches) which rainfall exceeded x4: Longest stream flow in watershed(1000s of feet) x6: estimated soil storage capacity (inches of water)
¼ inch/hour x7: Infiltration rate of water into soil (inches/hour)

It is resonable that the best model should contains at least one variable in each group. In this way, X8 and X9 are indispensable variables. We have also know that Setpwise forward method could neglect this context and is not recommended for this question. When we try to simplify a model, we should keep each catergory being not empty and evaluate one variable with others inside the group.

  • Log and interaction
Transform interact Eliminate model number of perdictors number of vairbales R-squared
ln(all) no stepboth p ln(X4) 1 1 0.910
ln(y) no Backward AIC X3,X4,X6,X7,X8,X9 6 6 0.947
ln(all) no stepboth AIC ln(X1),ln(X3),ln(X6),ln(X8),ln(X9) 5 5 0.984
Mixed yes Backward AIC X1:X3,X1:X4,ln(X8),ln(X9) 7 5 0.984
ln(all) no Backward AIC ln(X1),ln(X3),ln(X5),ln(X6),ln(X8),ln(X9) 6 6 0.986
ln(all) no Backward AIC ln(X1),ln(X3),ln(X4),ln(X6),ln(X8),ln(X9) 6 6 0.986
Mixed yes Backward AIC X1,X3,X4,X1:X3,X1:X4,ln(X7),ln(X8),ln(X9) 8 6 0.987
ln(y) yes Backward AIC omitted 15 9 0.994
ln(all) yes Backward AIC omitted 22 9 0.998

It is obviously that the model of all-log with interaction is overfitting. Therefore, we should control the number of predictors. Since there are 6 predictors by elimination regression. A new model with higher R-squared and less than 7 predictors is better than the solution in question (7). After comparing other coefficients and plots, the The best model should be 1-3-6-8-9 all-log model.

\[\hat{\ln(y)}=5.87592+0.68118\ln(X_1)+0.37527\ln(X_3)-0.33576\ln(X_6)+1.72493\ln(X_8)-1.45653\ln(X_9)\]

Model Summary
R 0.992 RMSE (Root Mean Square Error) 0.215
R-Squared 0.984 Coef. Var 3.374
Adj. R-Squared 0.981 MSE (Mean Square Error) 0.046
Pred R-Squared 0.974 MAE (Mean Absolute Error) 0.141
ANOVA
Sum of Squares DF Mean Square F p-value
Regression 70.285 5 14.057 304.557 \(2.2\times10^{-16}\)
Residual 1.108 24 0.046
Total 71.393 29
Parameter Estimates
model Estimated coefficients Partial SS Std. Error t test p-value 0.417 % 99.583 %
(Intercept) 5.87592 / 0.22021 26.684 \(2\times10^{-16}\) 5.2428046 6.5090281
log(X1) 0.68118 39 0.02343 29.067 \(2\times10^{-16}\) 0.6137985 0.7485522
log(X3) 0.37527 0.722 0.09489 3.955 0.00059 0.1024526 0.6480888
log(X6) -0.33576 0.783 0.08150 -4.120 0.000389 -0.5700799 -0.1014438
log(X8) 1.72499 4.88 0.16768 10.288 \(2.82\times10^{-10}\) 1.2429032 2.2070673
log(X9) -1.45653 4.37 0.14975 -9.727 \(8.42\times10^{-10}\) -1.8870607 -1.0259933

If we removed X1 in the question (3), we cannot get this one.

I tried other combination mixed log and interaction, the results is interesting. For example, single variable model (log(X4)) has higher R-squared than 3-4-7 model. A mixed combination of X1:X3,X1:X4,ln(X8),ln(X9) with 5 variables has higher R-squared than 3-4-6-7-8-9 model. Due to time constraints, I didn’t tied all the possible combinations. It also might has a better answer if we use glm.

  • Reference

[1]: Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis (Vol. 821). John Wiley & Sons.

[2]: Guillemette, F., Plamondon, A. P., Prévost, M., & Lévesque, D. (2005). Rainfall generated stormflow response to clearcutting a boreal forest: peak flow comparison with 50 world-wide basin studies. Journal of hydrology, 302(1-4), 137-153.

Code and output

(a) The matrix of scatterplots and the correlation matrix

library(tidyverse)
library(GGally)
library(olsrr)
library(car)
library(MASS)
library(huxtable)
table_wf <- read_table2("WaterFlow.txt")
ggpairs(data=table_wf[c(1:10)])

(c) The fitted full model

# build the model
model_wf_full <- lm(y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data=table_wf)
model_wf_full%>% summary()
## 
## Call:
## lm(formula = y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, 
##     data = table_wf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1404.21  -318.77    74.73   266.66  1274.30 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   292.56    4428.62   0.066   0.9480  
## X1           -203.14     410.27  -0.495   0.6259  
## X2           1055.78    9833.70   0.107   0.9156  
## X3            -49.24     156.20  -0.315   0.7558  
## X4            209.76     162.05   1.294   0.2103  
## X5            -10.20      51.09  -0.200   0.8438  
## X6            -24.56     303.53  -0.081   0.9363  
## X7            142.78    3288.44   0.043   0.9658  
## X8            511.71     209.74   2.440   0.0241 *
## X9           -301.87     172.00  -1.755   0.0945 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 609.3 on 20 degrees of freedom
## Multiple R-squared:  0.8214, Adjusted R-squared:  0.741 
## F-statistic: 10.22 on 9 and 20 DF,  p-value: 9.744e-06
Anova(model_wf_full)
Sum Sq Df F value Pr(>F)
9.1e+04  1 0.245   0.626 
4.28e+03 1 0.0115  0.916 
3.69e+04 1 0.0994  0.756 
6.22e+05 1 1.68    0.21  
1.48e+04 1 0.0398  0.844 
2.43e+03 1 0.00655 0.936 
700        1 0.00189 0.966 
2.21e+06 1 5.95    0.0241
1.14e+06 1 3.08    0.0945
7.43e+06 20             

(c) ii Residual diagnostics

#Model Fit Assessment
ols_plot_diagnostics(model_wf_full)

# Part & Partial Correlations
ols_test_correlation(model_wf_full) # Correlation between observed residuals and expected residuals under normality.
## [1] 0.9710713
# Residual Normality Test
ols_test_normality(model_wf_full) # Test for detecting violation of normality assumption. #If p-value is bigger, then no problem of non-normality #
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9589         0.2898 
## Kolmogorov-Smirnov        0.1423         0.5314 
## Cramer-von Mises          2.5333         0.0000 
## Anderson-Darling          0.5169         0.1748 
## -----------------------------------------------

(c) iii The partial regression and nonlinear diagnostics

#Lack of Fit F Test
# ols_pure_error_anova(lm(y~X8, data = table_wf))

# Variable Contributions
ols_plot_added_variable(model_wf_full)

# Residual Plus Component Plot
ols_plot_comp_plus_resid(model_wf_full)

(c) iv Collinearity diagnostics

# for full model
ols_vif_tol(model_wf_full)
Variables Tolerance VIF
X1 0.00982 102   
X2 0.133   7.52
X3 0.0318  31.4 
X4 0.00946 106   
X5 0.103   9.68
X6 0.433   2.31
X7 0.0487  20.5 
X8 0.182   5.5 
X9 0.174   5.75

(d) The fitted log y model

# build full log model

 table_wf_logy <- table_wf %>% mutate(logy=log(y))
 table_wf_logy$y <- NULL
 ggpairs(data=table_wf_logy[c(1:10)])

model_wf_full_log <- lm(log(y) ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data=table_wf)
summary(model_wf_full_log)
## 
## Call:
## lm(formula = log(y) ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + 
##     X9, data = table_wf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95298 -0.20764  0.01499  0.18100  0.67539 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.402256   3.150312   1.080 0.293006    
## X1          -0.013532   0.291845  -0.046 0.963477    
## X2          -1.023664   6.995235  -0.146 0.885120    
## X3           0.177966   0.111113   1.602 0.124908    
## X4           0.108788   0.115272   0.944 0.356560    
## X5          -0.009622   0.036341  -0.265 0.793898    
## X6          -0.389474   0.215916  -1.804 0.086345 .  
## X7           4.233475   2.339245   1.810 0.085387 .  
## X8           0.630070   0.149200   4.223 0.000418 ***
## X9          -0.462276   0.122350  -3.778 0.001181 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4334 on 20 degrees of freedom
## Multiple R-squared:  0.9474, Adjusted R-squared:  0.9237 
## F-statistic:    40 on 9 and 20 DF,  p-value: 7.513e-11
Anova(model_wf_full)
Sum Sq Df F value Pr(>F)
9.1e+04  1 0.245   0.626 
4.28e+03 1 0.0115  0.916 
3.69e+04 1 0.0994  0.756 
6.22e+05 1 1.68    0.21  
1.48e+04 1 0.0398  0.844 
2.43e+03 1 0.00655 0.936 
700        1 0.00189 0.966 
2.21e+06 1 5.95    0.0241
1.14e+06 1 3.08    0.0945
7.43e+06 20             
#Model Fit Assessment
ols_plot_diagnostics(model_wf_full_log)

# Part & Partial Correlations
ols_test_correlation(model_wf_full_log) # Correlation between observed residuals and expected residuals under normality.
## [1] 0.9808603
# Residual Normality Test
ols_test_normality(model_wf_full_log) # Test for detecting violation of normality assumption. #If p-value is bigger, then no problem of non-normality #
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9659         0.4341 
## Kolmogorov-Smirnov        0.0993         0.9007 
## Cramer-von Mises          4.948          0.0000 
## Anderson-Darling          0.3686         0.4062 
## -----------------------------------------------

(d) (2) Collinearity diagnostics

# remove 1-9
model_wf_rm1_log <- lm(log(y) ~ X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data=table_wf)
model_wf_rm2_log <- lm(log(y) ~ X1 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data=table_wf)
model_wf_rm3_log <- lm(log(y) ~ X1 + X2 + X4 + X5 + X6 + X7 + X8 + X9, data=table_wf)
model_wf_rm4_log <- lm(log(y) ~ X1 + X2 + X3 + X5 + X6 + X7 + X8 + X9, data=table_wf)
model_wf_rm5_log <- lm(log(y) ~ X1 + X2 + X3 + X4 + X6 + X7 + X8 + X9, data=table_wf)
model_wf_rm6_log <- lm(log(y) ~ X1 + X2 + X3 + X4 + X5 + X7 + X8 + X9, data=table_wf)
model_wf_rm7_log <- lm(log(y) ~ X1 + X2 + X3 + X4 + X5 + X6 + X8 + X9, data=table_wf)
model_wf_rm8_log <- lm(log(y) ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X9, data=table_wf)
model_wf_rm9_log <- lm(log(y) ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data=table_wf)
# Compare vif
ols_vif_tol(model_wf_full_log)
Variables Tolerance VIF
X1 0.00982 102   
X2 0.133   7.52
X3 0.0318  31.4 
X4 0.00946 106   
X5 0.103   9.68
X6 0.433   2.31
X7 0.0487  20.5 
X8 0.182   5.5 
X9 0.174   5.75
ols_vif_tol(model_wf_aic_log)
Variables Tolerance VIF
X3 0.332 3.01
X4 0.167 5.97
X6 0.839 1.19
X7 0.159 6.28
X8 0.202 4.94
X9 0.195 5.12
ols_vif_tol(model_wf_rm1_log)
Variables Tolerance VIF
X2 0.245 4.08
X3 0.318 3.14
X4 0.115 8.7 
X5 0.283 3.54
X6 0.717 1.39
X7 0.118 8.46
X8 0.19  5.27
X9 0.185 5.41
ols_vif_tol(model_wf_rm1_aic_log)
Variables Tolerance VIF
X3 0.332 3.01
X4 0.167 5.97
X6 0.839 1.19
X7 0.159 6.28
X8 0.202 4.94
X9 0.195 5.12
ols_vif_tol(model_wf_rm2_log)
Variables Tolerance VIF
X1 0.0181 55.2 
X3 0.0583 17.2 
X4 0.0148 67.4 
X5 0.163  6.13
X6 0.543  1.84
X7 0.114  8.79
X8 0.183  5.46
X9 0.175  5.72
ols_vif_tol(model_wf_rm2_aic_log)
Variables Tolerance VIF
X3 0.332 3.01
X4 0.167 5.97
X6 0.839 1.19
X7 0.159 6.28
X8 0.202 4.94
X9 0.195 5.12
ols_vif_tol(model_wf_rm3_log)
Variables Tolerance VIF
X1 0.0983 10.2 
X2 0.243  4.11
X4 0.113  8.87
X5 0.272  3.68
X6 0.767  1.3 
X7 0.206  4.85
X8 0.19   5.26
X9 0.187  5.36
ols_vif_tol(model_wf_rm3_aic_log)
Variables Tolerance VIF
X1 0.119 8.39
X2 0.313 3.19
X4 0.123 8.12
X5 0.343 2.92
X8 0.209 4.79
X9 0.205 4.88
ols_vif_tol(model_wf_rm4_log)
Variables Tolerance VIF
X1 0.119 8.38
X2 0.209 4.79
X3 0.379 2.64
X5 0.187 5.35
X6 0.836 1.2 
X7 0.165 6.05
X8 0.187 5.35
X9 0.183 5.45
ols_vif_tol(model_wf_rm4_aic_log)
Variables Tolerance VIF
X1 0.279 3.58
X3 0.768 1.3 
X6 0.917 1.09
X7 0.251 3.99
X8 0.202 4.94
X9 0.196 5.1 
ols_vif_tol(model_wf_rm5_log)
Variables Tolerance VIF
X1 0.0269 37.2 
X2 0.21   4.77
X3 0.0836 12   
X4 0.0171 58.4 
X6 0.485  2.06
X7 0.0774 12.9 
X8 0.2    5.01
X9 0.19   5.25
ols_vif_tol(model_wf_rm5_aic_log)
Variables Tolerance VIF
X3 0.332 3.01
X4 0.167 5.97
X6 0.839 1.19
X7 0.159 6.28
X8 0.202 4.94
X9 0.195 5.12
ols_vif_tol(model_wf_rm6_log)
Variables Tolerance VIF
X1 0.0162 61.5 
X2 0.167  6   
X3 0.0563 17.8 
X4 0.0182 54.8 
X5 0.116  8.64
X7 0.0832 12   
X8 0.182  5.49
X9 0.174  5.75
ols_vif_tol(model_wf_rm6_aic_log)
Variables Tolerance VIF
X1 0.119 8.39
X2 0.313 3.19
X4 0.123 8.12
X5 0.343 2.92
X8 0.209 4.79
X9 0.205 4.88
ols_vif_tol(model_wf_rm7_log)
Variables Tolerance VIF
X1 0.0238 42   
X2 0.31   3.22
X3 0.135  7.42
X4 0.0321 31.1 
X5 0.164  6.09
X6 0.74   1.35
X8 0.184  5.45
X9 0.177  5.66
ols_vif_tol(model_wf_rm7_aic_log)
Variables Tolerance VIF
X1 0.119 8.39
X2 0.313 3.19
X4 0.123 8.12
X5 0.343 2.92
X8 0.209 4.79
X9 0.205 4.88
ols_vif_tol(model_wf_rm8_log)
Variables Tolerance VIF
X1 0.0103  97.5 
X2 0.134   7.46
X3 0.0333  30   
X4 0.00973 103   
X5 0.114   8.81
X6 0.435   2.3 
X7 0.0492  20.3 
X9 0.879   1.14
ols_vif_tol(model_wf_rm8_aic_log)
Variables Tolerance VIF
X1 0.281 3.56
X3 0.773 1.29
X6 0.949 1.05
X7 0.252 3.96
ols_vif_tol(model_wf_rm9_log)
Variables Tolerance VIF
X1 0.0104  95.8 
X2 0.134   7.48
X3 0.0341  29.3 
X4 0.00998 100   
X5 0.113   8.83
X6 0.433   2.31
X7 0.0495  20.2 
X8 0.919   1.09
ols_vif_tol(model_wf_rm9_aic_log)
Variables Tolerance VIF
X1 0.28  3.58
X3 0.771 1.3 
X6 0.945 1.06
X7 0.252 3.97
X8 0.963 1.04

(d) (3) Variable selection

# Removing one variable
huxreg(model_wf_rm1_log, model_wf_rm2_log, model_wf_rm3_log, model_wf_rm4_log, model_wf_rm5_log, model_wf_rm6_log, model_wf_rm7_log, model_wf_rm8_log, model_wf_rm9_log, model_wf_full_log)
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
(Intercept) 3.280     3.703     7.731 *** 1.200     2.581 *** 5.523     7.487 **  -0.182  -0.127   3.402    
(1.690)    (2.333)    (1.678)    (2.110)    (0.547)    (3.075)    (2.314)    (4.072) (3.844)  (3.150)   
X2 -1.243              6.526     -5.001     -2.144     4.655     8.550     -3.730  -2.980   -1.024    
(5.025)             (5.358)    (5.568)    (5.445)    (6.574)    (4.819)    (9.350) (8.912)  (6.995)   
X3 0.183 *** 0.167 *            0.278 *** 0.201 **  0.046     0.002     0.277  0.287 * 0.178    
(0.034)    (0.080)             (0.032)    (0.067)    (0.088)    (0.057)    (0.146) (0.137)  (0.111)   
X4 0.104 **  0.119     0.286 ***          0.088     0.253 **  0.284 *** 0.027  0.009   0.109    
(0.032)    (0.090)    (0.035)             (0.084)    (0.087)    (0.066)    (0.153) (0.143)  (0.115)   
X5 -0.008     -0.013     -0.055 *   0.013              -0.031     -0.050     0.036  0.031   -0.010    
(0.021)    (0.028)    (0.023)    (0.027)             (0.036)    (0.030)    (0.047) (0.044)  (0.036)   
X6 -0.396 *   -0.375     -0.161     -0.531 **  -0.408              -0.138     -0.335  -0.385   -0.389    
(0.164)    (0.188)    (0.168)    (0.155)    (0.199)             (0.174)    (0.289) (0.276)  (0.216)   
X7 4.317 **  3.975 *   0.959     6.088 *** 4.611 *   1.517              5.230  5.352   4.233    
(1.466)    (1.495)    (1.178)    (1.266)    (1.814)    (1.884)             (3.124) (2.965)  (2.339)   
X8 0.629 *** 0.632 *** 0.680 *** 0.606 *** 0.618 *** 0.614 *** 0.657 ***       0.125   0.630 ***
(0.142)    (0.145)    (0.151)    (0.147)    (0.139)    (0.157)    (0.156)          (0.085)  (0.149)   
X9 -0.461 *** -0.464 *** -0.513 *** -0.436 **  -0.453 *** -0.461 **  -0.490 *** 0.001         -0.462 ** 
(0.116)    (0.119)    (0.122)    (0.119)    (0.114)    (0.129)    (0.128)    (0.073)        (0.122)   
X1          -0.042     -0.457 *** 0.250 **  0.048     -0.345     -0.418 *   0.242  0.255   -0.014    
         (0.210)    (0.096)    (0.083)    (0.173)    (0.239)    (0.197)    (0.383) (0.362)  (0.292)   
N 30         30         30         30         30         30         30         30      30       30        
R2 0.947     0.947     0.941     0.945     0.947     0.939     0.939     0.900  0.910   0.947    
logLik -11.407     -11.422     -13.216     -12.059     -11.458     -13.667     -13.681     -20.968  -19.486   -11.406    
AIC 42.815     42.843     46.432     44.118     42.916     47.333     47.361     61.935  58.972   44.811    
*** p < 0.001; ** p < 0.01; * p < 0.05.
# The results of elimination regression
huxreg(model_wf_rm1_aic_log, model_wf_rm2_aic_log, model_wf_rm3_aic_log, model_wf_rm4_aic_log, model_wf_rm5_aic_log, model_wf_rm6_aic_log, model_wf_rm7_aic_log, model_wf_rm8_aic_log, model_wf_rm9_aic_log, model_wf_aic_log)
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
(Intercept) 2.692 *** 2.692 *** 6.882 *** 2.307 *** 2.692 *** 6.882 *** 6.882 *** 2.587 *** 2.225 *** 2.692 ***
(0.445)    (0.445)    (1.432)    (0.410)    (0.445)    (1.432)    (1.432)    (0.494)    (0.515)    (0.445)   
X3 0.184 *** 0.184 ***          0.263 *** 0.184 ***                   0.266 *** 0.268 *** 0.184 ***
(0.032)    (0.032)             (0.022)    (0.032)                      (0.029)    (0.028)    (0.032)   
X4 0.109 *** 0.109 *** 0.294 ***          0.109 *** 0.294 *** 0.294 ***                   0.109 ***
(0.026)    (0.026)    (0.033)             (0.026)    (0.033)    (0.033)                      (0.026)   
X6 -0.368 *   -0.368 *            -0.532 **  -0.368 *                     -0.416 *   -0.435 *   -0.368 *  
(0.146)    (0.146)             (0.144)    (0.146)                      (0.186)    (0.179)    (0.146)   
X7 4.085 **  4.085 **           5.453 *** 4.085 **                    5.209 *** 5.174 *** 4.085 ** 
(1.213)    (1.213)             (1.002)    (1.213)                      (1.310)    (1.256)    (1.213)   
X8 0.612 *** 0.612 *** 0.629 *** 0.613 *** 0.612 *** 0.629 *** 0.629 ***          0.141     0.612 ***
(0.133)    (0.133)    (0.142)    (0.137)    (0.133)    (0.142)    (0.142)             (0.079)    (0.133)   
X9 -0.448 *** -0.448 *** -0.471 *** -0.433 *** -0.448 *** -0.471 *** -0.471 ***                   -0.448 ***
(0.108)    (0.108)    (0.115)    (0.112)    (0.108)    (0.115)    (0.115)                      (0.108)   
X1                   -0.432 *** 0.207 ***          -0.432 *** -0.432 *** 0.208 **  0.199 **          
                  (0.086)    (0.053)             (0.086)    (0.086)    (0.070)    (0.067)            
X2                   8.217                       8.217     8.217                               
                  (4.655)                      (4.655)    (4.655)                              
X5                   -0.043 *                     -0.043 *   -0.043 *                             
                  (0.020)                      (0.020)    (0.020)                              
N 30         30         30         30         30         30         30         30         30         30        
R2 0.947     0.947     0.937     0.943     0.947     0.937     0.937     0.893     0.906     0.947    
logLik -11.581     -11.581     -14.144     -12.652     -11.581     -14.144     -14.144     -22.024     -20.155     -11.581    
AIC 39.161     39.161     44.288     41.305     39.161     44.288     44.288     56.049     54.311     39.161    
*** p < 0.001; ** p < 0.01; * p < 0.05.

(d) (4) Forward selection

Stepwise Forward Regression for full model

# Stepwise Forward Regression based on p values (use a=0.15) #
ols_step_forward_p(model_wf_full_log, penter = 0.15)
## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. X1 
## 2. X2 
## 3. X3 
## 4. X4 
## 5. X5 
## 6. X6 
## 7. X7 
## 8. X8 
## 9. X9 
## 
## We are selecting variables based on p value...
## 
## Variables Entered: 
## 
## - X4 
## - X3 
## - X7 
## 
## No more variables to be added.
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.944       RMSE               0.549 
## R-Squared               0.890       Coef. Var          8.618 
## Adj. R-Squared          0.878       MSE                0.301 
## Pred R-Squared          0.854       MAE                0.414 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression     63.565         3         21.188    70.378    0.0000 
## Residual        7.828        26          0.301                     
## Total          71.393        29                                    
## -------------------------------------------------------------------
## 
##                                  Parameter Estimates                                  
## -------------------------------------------------------------------------------------
##       model     Beta    Std. Error    Std. Beta      t       Sig      lower    upper 
## -------------------------------------------------------------------------------------
## (Intercept)    2.872         0.547                 5.254    0.000     1.748    3.995 
##          X4    0.122         0.033        0.559    3.730    0.001     0.055    0.189 
##          X3    0.168         0.040        0.435    4.165    0.000     0.085    0.251 
##          X7    3.106         1.537        0.309    2.021    0.054    -0.053    6.266 
## -------------------------------------------------------------------------------------
## 
##                            Selection Summary                             
## ------------------------------------------------------------------------
##         Variable                  Adj.                                      
## Step    Entered     R-Square    R-Square     C(p)        AIC       RMSE     
## ------------------------------------------------------------------------
##    1    X4            0.8030      0.7960    48.8552    68.4060    0.7087    
##    2    X3            0.8731      0.8637    24.2129    57.2082    0.5792    
##    3    X7            0.8904      0.8777    19.6668    54.8305    0.5487    
## ------------------------------------------------------------------------
# Stepwise AIC Forward Regression #
ols_step_forward_aic(model_wf_full_log)
## Forward Selection Method 
## ------------------------
## 
## Candidate Terms: 
## 
## 1 . X1 
## 2 . X2 
## 3 . X3 
## 4 . X4 
## 5 . X5 
## 6 . X6 
## 7 . X7 
## 8 . X8 
## 9 . X9 
## 
## 
## Variables Entered: 
## 
## - X4 
## - X3 
## - X7 
## - X8 
## - X9 
## - X6 
## 
## No more variables to be added.
## 
##                        Selection Summary                        
## ---------------------------------------------------------------
## Variable      AIC      Sum Sq     RSS       R-Sq      Adj. R-Sq 
## ---------------------------------------------------------------
## X4           68.406    57.330    14.063    0.80302      0.79599 
## X3           57.208    62.335     9.057    0.87313      0.86373 
## X7           54.830    63.565     7.828    0.89036      0.87771 
## X8           54.522    64.144     7.248    0.89848      0.88223 
## X9           44.504    66.537     4.856    0.93199      0.91782 
## X6           39.161    67.591     3.801    0.94675      0.93286 
## ---------------------------------------------------------------

Stepwise Forward Regression for X4 eliminated model

# Stepwise Forward Regression based on p values (use a=0.15) #
ols_step_forward_p(model_wf_rm4_log, penter = 0.15)
# Stepwise AIC Forward Regression #
ols_step_forward_aic(model_wf_rm4_log)

Stepwise Forward Regression for X1 eliminated model

# Stepwise Forward Regression based on p values (use a=0.15) #
ols_step_forward_p(model_wf_rm1_log, penter = 0.15)
# Stepwise AIC Forward Regression #
ols_step_forward_aic(model_wf_rm1_log)

(d) (5) Backward selection

Stepwise Backward Regression for full model

# Stepwise Backward Regression based on p values (use a=0.05) #
ols_step_backward_p(model_wf_full_log, penter = 0.05)
## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1 . X1 
## 2 . X2 
## 3 . X3 
## 4 . X4 
## 5 . X5 
## 6 . X6 
## 7 . X7 
## 8 . X8 
## 9 . X9 
## 
## We are eliminating variables based on p value...
## 
## Variables Removed: 
## 
## - X1 
## - X2 
## - X5 
## 
## No more variables satisfy the condition of p value = 0.3
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.973       RMSE               0.407 
## R-Squared               0.947       Coef. Var          6.385 
## Adj. R-Squared          0.933       MSE                0.165 
## Pred R-Squared          0.908       MAE                0.273 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression     67.591         6         11.265     68.16    0.0000 
## Residual        3.801        23          0.165                     
## Total          71.393        29                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)     2.692         0.445                  6.046    0.000     1.771     3.613 
##          X3     0.184         0.032        0.476     5.698    0.000     0.117     0.251 
##          X4     0.109         0.026        0.499     4.244    0.000     0.056     0.162 
##          X6    -0.368         0.146       -0.133    -2.526    0.019    -0.669    -0.066 
##          X7     4.085         1.213        0.406     3.367    0.003     1.575     6.595 
##          X8     0.612         0.133        0.493     4.614    0.000     0.337     0.886 
##          X9    -0.448         0.108       -0.450    -4.135    0.000    -0.672    -0.224 
## ----------------------------------------------------------------------------------------
## 
## 
##                           Elimination Summary                           
## -----------------------------------------------------------------------
##         Variable                  Adj.                                     
## Step    Removed     R-Square    R-Square     C(p)       AIC       RMSE     
## -----------------------------------------------------------------------
##    1    X1            0.9474      0.9273    8.0021    42.8146    0.4230    
##    2    X2            0.9472      0.9304    6.0604    40.9019    0.4139    
##    3    X5            0.9468      0.9329    4.2345    39.1611    0.4065    
## -----------------------------------------------------------------------
# Stepwise AIC Backward Regression #
ols_step_backward_aic(model_wf_full_log)
## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1 . X1 
## 2 . X2 
## 3 . X3 
## 4 . X4 
## 5 . X5 
## 6 . X6 
## 7 . X7 
## 8 . X8 
## 9 . X9 
## 
## 
## Variables Removed: 
## 
## - X1 
## - X2 
## - X5 
## 
## No more variables to be removed.
## 
## 
##                   Backward Elimination Summary                   
## ---------------------------------------------------------------
## Variable       AIC       RSS     Sum Sq     R-Sq      Adj. R-Sq 
## ---------------------------------------------------------------
## Full Model    44.811    3.757    67.635    0.94737      0.92369 
## X1            42.815    3.758    67.635    0.94737      0.92731 
## X2            40.902    3.769    67.624    0.94721      0.93042 
## X5            39.161    3.801    67.591    0.94675      0.93286 
## ---------------------------------------------------------------

Stepwise Backward Regression for X4 eliminated model

# Stepwise Backward Regression based on p values (use a=0.05) #
ols_step_backward_p(model_wf_rm4_log, penter = 0.05)
# Stepwise AIC Backward Regression #
ols_step_backward_aic(model_wf_rm4_log)

Stepwise Backward Regression for X1 eliminated model

# Stepwise Backward Regression based on p values (use a=0.05) #
ols_step_backward_p(model_wf_rm1_log, penter = 0.05)
# Stepwise AIC Backward Regression #
ols_step_backward_aic(model_wf_rm1_log)

(d) (6) Best Subset Regression

# For full model #
k <- ols_step_best_subset(model_wf_full_log)
k
mindex n predictors rsquare adjr predrsq cp aic sbic sbc msep fpe apc hsp
1 1 X4 0.803 0.796 0.772 48.9  68.4 -19.8 72.6 0.538 0.536 0.225  0.0186 
2 2 X3 X4 0.873 0.864 0.844 24.2  57.2 -30.5 62.8 0.373 0.369 0.155  0.0129 
3 3 X3 X4 X7 0.89  0.878 0.854 19.7  54.8 -32.7 61.8 0.348 0.341 0.143  0.012  
4 4 X1 X4 X8 X9 0.921 0.908 0.886 10.1  47   -38.1 55.4 0.272 0.264 0.111  0.00941
5 5 X3 X4 X7 X8 X9 0.932 0.918 0.892 7.85 44.5 -38.8 54.3 0.255 0.243 0.102  0.0088 
6 6 X3 X4 X6 X7 X8 X9 0.947 0.933 0.908 4.23 39.2 -39.7 50.4 0.217 0.204 0.0857 0.00751
7 7 X3 X4 X5 X6 X7 X8 X9 0.947 0.93  0.902 6.06 40.9 -36.8 53.5 0.236 0.217 0.0912 0.00816
8 8 X2 X3 X4 X5 X6 X7 X8 X9 0.947 0.927 0.896 8    42.8 -33.8 56.8 0.259 0.233 0.0977 0.00895
9 9 X1 X2 X3 X4 X5 X6 X7 X8 X9 0.947 0.924 0.886 10    44.8 -30.8 60.2 0.286 0.25  0.105  0.00989
plot(k)

# For X4 eliminated model #
# k <- ols_step_best_subset(model_wf_rm4_log)
# k
# plot(k)

# For X1 eliminated model #
# k <- ols_step_best_subset(model_wf_rm1_log)
# k
# plot(k)

(d) (7) Models Comparison

  • Model 437896
# build model 437896
model_wf_437896_log <- lm(log(y) ~ X4 + X3 + X7 + X8 + X9 + X6, data=table_wf)
ols_regress(model_wf_437896_log)
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.973       RMSE               0.407 
## R-Squared               0.947       Coef. Var          6.385 
## Adj. R-Squared          0.933       MSE                0.165 
## Pred R-Squared          0.908       MAE                0.273 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression     67.591         6         11.265     68.16    0.0000 
## Residual        3.801        23          0.165                     
## Total          71.393        29                                    
## -------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)     2.692         0.445                  6.046    0.000     1.771     3.613 
##          X4     0.109         0.026        0.499     4.244    0.000     0.056     0.162 
##          X3     0.184         0.032        0.476     5.698    0.000     0.117     0.251 
##          X7     4.085         1.213        0.406     3.367    0.003     1.575     6.595 
##          X8     0.612         0.133        0.493     4.614    0.000     0.337     0.886 
##          X9    -0.448         0.108       -0.450    -4.135    0.000    -0.672    -0.224 
##          X6    -0.368         0.146       -0.133    -2.526    0.019    -0.669    -0.066 
## ----------------------------------------------------------------------------------------
Anova(model_wf_437896_log)
Sum Sq Df F value Pr(>F)
2.98 1 18    0.000306
5.37 1 32.5  8.41e-06
1.87 1 11.3  0.00266 
3.52 1 21.3  0.000122
2.83 1 17.1  0.000402
1.05 1 6.38 0.0189  
3.8  23            
confint(model_wf_437896_log, level = 1-(0.05/7)) # Bonferroni joint confidence interval #
##                 0.357 %    99.643 %
## (Intercept)  1.37732232  4.00627202
## X4           0.03318876  0.18491189
## X3           0.08857700  0.27911109
## X7           0.50312634  7.66681371
## X8           0.22022202  1.00298907
## X9          -0.76727751 -0.12799849
## X6          -0.79716475  0.06213151
# Collinearity Diagnostics #
ols_vif_tol(model_wf_437896_log)
Variables Tolerance VIF
X4 0.167 5.97
X3 0.332 3.01
X7 0.159 6.28
X8 0.202 4.94
X9 0.195 5.12
X6 0.839 1.19
#Model Fit Assessment
ols_plot_diagnostics(model_wf_437896_log)

# Part & Partial Correlations
ols_test_correlation(model_wf_437896_log) # Correlation between observed residuals and expected residuals under normality.
## [1] 0.9837263
# Residual Normality Test
ols_test_normality(model_wf_437896_log) # Test for detecting violation of normality assumption. #If p-value is bigger, then no problem of non-normality #
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9728         0.6175 
## Kolmogorov-Smirnov        0.0997         0.8982 
## Cramer-von Mises          4.8429         0.0000 
## Anderson-Darling          0.2996         0.5612 
## -----------------------------------------------
# Variable Contributions
ols_plot_added_variable(model_wf_437896_log)

# Residual Plus Component Plot
ols_plot_comp_plus_resid(model_wf_437896_log)

  • Model 437
# build model 437
model_wf_437_log <- lm(log(y) ~ X4 + X3 + X7, data=table_wf)
ols_regress(model_wf_437_log)
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.944       RMSE               0.549 
## R-Squared               0.890       Coef. Var          8.618 
## Adj. R-Squared          0.878       MSE                0.301 
## Pred R-Squared          0.854       MAE                0.414 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression     63.565         3         21.188    70.378    0.0000 
## Residual        7.828        26          0.301                     
## Total          71.393        29                                    
## -------------------------------------------------------------------
## 
##                                  Parameter Estimates                                  
## -------------------------------------------------------------------------------------
##       model     Beta    Std. Error    Std. Beta      t       Sig      lower    upper 
## -------------------------------------------------------------------------------------
## (Intercept)    2.872         0.547                 5.254    0.000     1.748    3.995 
##          X4    0.122         0.033        0.559    3.730    0.001     0.055    0.189 
##          X3    0.168         0.040        0.435    4.165    0.000     0.085    0.251 
##          X7    3.106         1.537        0.309    2.021    0.054    -0.053    6.266 
## -------------------------------------------------------------------------------------
# Collinearity Diagnostics #
ols_vif_tol(model_wf_437_log)
Variables Tolerance VIF
X4 0.188 5.32
X3 0.386 2.59
X7 0.181 5.53
#Model Fit Assessment
ols_plot_diagnostics(model_wf_437_log)

# Part & Partial Correlations
ols_test_correlation(model_wf_437_log) # Correlation between observed residuals and expected residuals under normality.
## [1] 0.9856766
# Residual Normality Test
ols_test_normality(model_wf_437_log) # Test for detecting violation of normality assumption. #If p-value is bigger, then no problem of non-normality #
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9765         0.7267 
## Kolmogorov-Smirnov        0.1033         0.8736 
## Cramer-von Mises          3.1908         0.0000 
## Anderson-Darling          0.3511         0.4469 
## -----------------------------------------------
# Variable Contributions
ols_plot_added_variable(model_wf_437_log)

# Residual Plus Component Plot
ols_plot_comp_plus_resid(model_wf_437_log)

# Check PRESS Statistic
ols_press(model_wf_full)
## [1] 15880486
ols_press(model_wf_full_log)
## [1] 8.136733
ols_press(model_wf_437896_log)
## [1] 6.538275
ols_press(model_wf_437_log)
## [1] 10.43262
# ols_press(model_wf_137689_log)

# prediction power
ols_pred_rsq(model_wf_full)
## [1] 0.6179649
ols_pred_rsq(model_wf_full_log)
## [1] 0.8860283
ols_pred_rsq(model_wf_437896_log)
## [1] 0.908418
ols_pred_rsq(model_wf_437_log)
## [1] 0.8538697
# ols_pred_rsq(model_wf_137689_log)

More

  • Other Models
# build X1*X8 eliminated log model
model_wf_18rm4_log <- lm(log(y) ~ X1*X8 + X3 + X6 + X7 + X9, data=table_wf)

# build X1*X8 eliminated log model
table_wf_resi <- table_wf%>% mutate(x1t8=X1*X8)
model_wf_1time8_log <- lm(log(y) ~ x1t8 + X3 + X6 + X7+ X9 , data=table_wf_resi)

# build X1*X4 eliminated log model
table_wf_resi <- table_wf%>% mutate(x1t4=X1*X4)
model_wf_1time4_log <- lm(log(y) ~ x1t4 + X3 + X6 + X7+ X8+ X9, data=table_wf_resi)
summary(model_wf_1time4_log)

# build X1/X4 eliminated log model
table_wf_resi <- table_wf%>% mutate(x14=X1/X4)
model_wf_1per4_log <- lm(log(y) ~ x14 + X3 + X6 + X7+ X8+ X9, data=table_wf_resi)

# build X4*X3 eliminated log model
model_wf_43rm1_log <- lm(log(y) ~ X9 + X4*X3 + X6 + X7 + X8 , data=table_wf)

# build X4*X9 eliminated log model
model_wf_49rm1_log <- lm(log(y) ~ X3 + X4*X9 + X6 + X7 + X8 , data=table_wf)

# build X4*X9 eliminated log model
model_wf_48rm1_log <- lm(log(y) ~ X3 + X4*X8 + X6 + X7 + X9 , data=table_wf)

# build X4*X9 eliminated log model
model_wf_47rm1_log <- lm(log(y) ~ X3 + X4*X7 + X6 + X9 + X8 , data=table_wf)

# build X4/X9 eliminated log model
table_wf_resi <- table_wf%>% mutate(x4p9=X4/X9)
model_wf_4per9_log <- lm(log(y) ~ X3 + x4p9 + X6 + X7 + X8 , data=table_wf_resi)

# build X3/X4vX8*X9 eliminated log model
model_wf_34v89_log <- lm(log(y) ~ X3*X4 + X8*X9 + X6 + X7, data=table_wf_resi)

# build X3/X4vX8*X9 eliminated log model
model_wf_34v89v67_log <- lm(log(y) ~ X3*X4 + X8*X9 + X6*X7, data=table_wf_resi)

# build X8/X9vX4*X3 eliminated log model
table_wf_resi <- table_wf%>% mutate(x8p9=X8/X9)
model_wf_8per9v43_log <- lm(log(y) ~ x8p9 + X4*X3 + X6 + X7, data=table_wf_resi)

# build X6/7vX8/X9vX4X3 eliminated log model
table_wf_resi <- table_wf%>% mutate(x8p9=X8/X9,x6p7=X6/X7)
model_wf_6p7v8p9v43_log <- lm(log(y) ~ x8p9 + X4*X3 + x6p7, data=table_wf_resi)

# build X8/X9vX4*X3rmX7 eliminated log model
table_wf_resi <- table_wf%>% mutate(x8p9=X8/X9)
model_wf_8per9v43rm7_log <- lm(log(y) ~ x8p9 + X4*X3 + X6, data=table_wf_resi)

# build X8/X9vX4*X3vX6/X7 eliminated log model
table_wf_resi <- table_wf%>% mutate(x8p9=X8/X9)
model_wf_8per9v43rm7_log <- lm(log(y) ~ x8p9 + X4*X3 + X6, data=table_wf_resi)

huxreg(model_wf_8per9v43rm7_log, model_wf_8per9v43_log, model_wf_43rm1_log, model_wf_6p7v8p9v43_log, model_wf_34v89_log, model_wf_34v89v67_log)
  • Interaction models
# Interaction regression for full interaction
model_wf_full_log_inter <- lm(log(y)~ (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9)^2, data=table_wf)
  • All log models
# build all log model
model_wf_all_log <- lm(log(y) ~ log(X1) + log(X2) + log(X3) + log(X4) + log(X5) + log(X6) + log(X7) + log(X8) + log(X9), data=table_wf)
ols_vif_tol(model_wf_all_log)
Variables Tolerance VIF
log(X1) 0.00608 164   
log(X2) 0.0865  11.6 
log(X3) 0.0816  12.3 
log(X4) 0.00767 130   
log(X5) 0.0885  11.3 
log(X6) 0.421   2.37
log(X7) 0.108   9.29
log(X8) 0.193   5.18
log(X9) 0.187   5.35
model_wf_all_log_inter <- lm(log(y) ~ (log(X1) + log(X2) + log(X3) + log(X4) + log(X5) + log(X6) + log(X7) + log(X8) + log(X9))^2, data=table_wf)
model_wf_rm2457_all_log <- lm(log(y) ~ log(X1) + log(X3) + log(X6) + log(X8) + log(X9), data=table_wf)
model_wf_rm257_all_log <- lm(log(y) ~ log(X1) + log(X3) + log(X4) + log(X6) + log(X8) + log(X9), data=table_wf)
model_wf_rm247_all_log <- lm(log(y) ~ log(X1) + log(X3) + log(X5) + log(X6) + log(X8) + log(X9), data=table_wf)
model_wf_rm125_all_log <- lm(log(y) ~ log(X3) + log(X4) + log(X6) + log(X7) + log(X8) + log(X9), data=table_wf)
ols_vif_tol(model_wf_aic_all_log)
Variables Tolerance VIF
log(X1) 0.263 3.8 
log(X3) 0.603 1.66
log(X5) 0.22  4.55
log(X6) 0.71  1.41
log(X8) 0.201 4.99
log(X9) 0.191 5.22
# Comparison
huxreg(model_wf_aic_log, model_wf_aic_all_log,model_wf_rm2457_all_log, model_wf_rm247_all_log, model_wf_rm257_all_log, model_wf_rm125_all_log, model_wf_aic_log_inter, model_wf_aic_all_log_inter)
(1) (2) (3) (4) (5) (6) (7) (8)
(Intercept) 2.692 *** 0.571     5.876 *** 0.571     5.297 *** 4.198 *** -0.981     -16.027   
(0.445)    (3.360)    (0.220)    (3.360)    (0.492)    (0.226)    (1.520)    (13.947)  
X3 0.184 ***                                              0.493 ***        
(0.032)                                                 (0.066)           
X4 0.109 ***                                              0.168 **         
(0.026)                                                 (0.054)           
X6 -0.368 *                                                                
(0.146)                                                                 
X7 4.085 **                                               2.515            
(1.213)                                                 (1.258)           
X8 0.612 ***                                              0.586 ***        
(0.133)                                                 (0.087)           
X9 -0.448 ***                                              -0.248 *          
(0.108)                                                 (0.105)           
log(X1)          0.726 *** 0.681 *** 0.726 *** 0.463 *                     1.158   
         (0.036)    (0.023)    (0.036)    (0.168)                      (0.697)  
log(X3)          0.419 *** 0.375 *** 0.419 *** 0.249     0.285              1.666   
         (0.096)    (0.095)    (0.096)    (0.134)    (0.205)             (1.763)  
log(X5)          1.259              1.259                                5.028   
         (0.796)             (0.796)                               (3.473)  
log(X6)          -0.267 **  -0.336 *** -0.267 **  -0.326 *** -0.366 ***          -0.158   
         (0.090)    (0.081)    (0.090)    (0.081)    (0.096)             (0.300)  
log(X8)          1.623 *** 1.725 *** 1.623 *** 1.715 *** 1.676 ***          44.919   
         (0.175)    (0.168)    (0.175)    (0.165)    (0.180)             (24.175)  
log(X9)          -1.375 *** -1.457 *** -1.375 *** -1.450 *** -1.409 ***          -37.066   
         (0.154)    (0.150)    (0.154)    (0.148)    (0.161)             (19.170)  
log(X4)                                     0.417     1.130 ***          -2.394   
                                    (0.318)    (0.107)             (2.963)  
log(X7)                                              0.356              -0.408   
                                             (0.220)             (0.518)  
X1                                                       2.712 ***        
                                                      (0.330)           
X2                                                       -24.452 ***        
                                                      (5.631)           
X5                                                       0.039 *          
                                                      (0.016)           
X1:X3                                                       -0.416 ***        
                                                      (0.045)           
X1:X9                                                       -0.118 **         
                                                      (0.036)           
X2:X9                                                       4.643 **         
                                                      (1.449)           
X3:X8                                                       -0.051 **         
                                                      (0.014)           
X4:X8                                                       0.071 ***        
                                                      (0.015)           
X7:X9                                                       -0.995 *          
                                                      (0.344)           
log(X2)                                                                -0.733   
                                                               (0.338)  
log(X1):log(X3)                                                                0.823   
                                                               (0.707)  
log(X1):log(X8)                                                                1.242   
                                                               (0.641)  
log(X1):log(X9)                                                                -1.000   
                                                               (0.674)  
log(X2):log(X8)                                                                1.089 * 
                                                               (0.397)  
log(X2):log(X9)                                                                -0.710   
                                                               (0.341)  
log(X3):log(X9)                                                                0.411   
                                                               (0.356)  
log(X4):log(X8)                                                                -3.029 **
                                                               (0.849)  
log(X4):log(X9)                                                                2.174   
                                                               (0.929)  
log(X5):log(X8)                                                                -7.969   
                                                               (5.562)  
log(X5):log(X9)                                                                6.795   
                                                               (4.450)  
log(X6):log(X8)                                                                0.403   
                                                               (0.289)  
log(X6):log(X9)                                                                -0.506   
                                                               (0.265)  
log(X7):log(X9)                                                                0.464   
                                                               (0.376)  
N 30         30         30         30         30         30         30         30       
R2 0.947     0.986     0.984     0.986     0.986     0.983     0.994     0.998   
logLik -11.581     8.464     6.915     8.464     7.995     5.333     20.797     41.586   
AIC 39.161     -0.929     0.170     -0.929     0.009     5.335     -9.594     -35.172   
*** p < 0.001; ** p < 0.01; * p < 0.05.
  • Mixed models
# Mixed regression 1
model_wf_mix1 <- lm(log(y) ~ (X1 + X3 + X4 )^2 + log(X8) + log(X9), data=table_wf)

# Mixed regression 2
model_wf_mix2 <- lm(log(y) ~ (log(X1) + log(X3) + log(X4) )^2 + log(X8) + log(X9), data=table_wf)

# Mixed regression 3
model_wf_mix3 <- lm(log(y) ~ (X1 + X3 + X4 )^2 + log(X2)+log(X5)+log(X6)+log(X7) + log(X8) + log(X9), data=table_wf)
# Comparison
huxreg(model_wf_aic_mix1, model_wf_aic_mix2, model_wf_aic_mix3)
(1) (2) (3)
(Intercept) 2.598 *** 3.149 *** 2.044 ***
(0.228)    (0.738)    (0.343)   
X1 1.290 ***          1.459 ***
(0.232)             (0.232)   
X3 0.296 ***          0.275 ***
(0.040)             (0.039)   
X4 0.391 ***          0.423 ***
(0.042)             (0.042)   
log(X8) 1.575 *** 1.576 *** 1.628 ***
(0.172)    (0.168)    (0.163)   
log(X9) -1.345 *** -1.346 *** -1.405 ***
(0.154)    (0.150)    (0.147)   
X1:X3 -0.381 ***          -0.398 ***
(0.056)             (0.053)   
X1:X4 0.031 *            0.026 *  
(0.012)             (0.012)   
log(X1)          0.093             
         (0.194)            
log(X3)          -1.261 **          
         (0.436)            
log(X4)          3.081 ***         
         (0.783)            
log(X1):log(X3)          -0.635 **          
         (0.177)            
log(X7)                   -0.428    
                  (0.208)   
N 30         30         30        
R2 0.984     0.984     0.987    
logLik 6.624     6.622     9.375    
AIC 4.752     2.757     1.250    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Additional Steps

  • step both
# Stepwise Regression based on p values for full model#
k <- ols_step_both_p(model_wf_full_log)
## Stepwise Selection Method   
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. X1 
## 2. X2 
## 3. X3 
## 4. X4 
## 5. X5 
## 6. X6 
## 7. X7 
## 8. X8 
## 9. X9 
## 
## We are selecting variables based on p value...
## 
## Variables Entered/Removed: 
## 
## - X4 added 
## - X3 added 
## - X7 added 
## 
## No more variables to be added/removed.
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.944       RMSE               0.549 
## R-Squared               0.890       Coef. Var          8.618 
## Adj. R-Squared          0.878       MSE                0.301 
## Pred R-Squared          0.854       MAE                0.414 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                
## -------------------------------------------------------------------
##                Sum of                                              
##               Squares        DF    Mean Square      F         Sig. 
## -------------------------------------------------------------------
## Regression     63.565         3         21.188    70.378    0.0000 
## Residual        7.828        26          0.301                     
## Total          71.393        29                                    
## -------------------------------------------------------------------
## 
##                                  Parameter Estimates                                  
## -------------------------------------------------------------------------------------
##       model     Beta    Std. Error    Std. Beta      t       Sig      lower    upper 
## -------------------------------------------------------------------------------------
## (Intercept)    2.872         0.547                 5.254    0.000     1.748    3.995 
##          X4    0.122         0.033        0.559    3.730    0.001     0.055    0.189 
##          X3    0.168         0.040        0.435    4.165    0.000     0.085    0.251 
##          X7    3.106         1.537        0.309    2.021    0.054    -0.053    6.266 
## -------------------------------------------------------------------------------------
# k
# plot(k)

# Stepwise AIC Regression for full model#
k<- ols_step_both_aic(model_wf_full_log)
## Stepwise Selection Method 
## -------------------------
## 
## Candidate Terms: 
## 
## 1 . X1 
## 2 . X2 
## 3 . X3 
## 4 . X4 
## 5 . X5 
## 6 . X6 
## 7 . X7 
## 8 . X8 
## 9 . X9 
## 
## 
## Variables Entered/Removed: 
## 
## - X4 added 
## - X3 added 
## - X7 added 
## - X8 added 
## - X9 added 
## - X6 added 
## 
## No more variables to be added or removed.
# k
# plot(k)

# Stepwise Regression based on p values for all log model #
k <- ols_step_both_p(model_wf_all_log)
## Stepwise Selection Method   
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. log(X1) 
## 2. log(X2) 
## 3. log(X3) 
## 4. log(X4) 
## 5. log(X5) 
## 6. log(X6) 
## 7. log(X7) 
## 8. log(X8) 
## 9. log(X9) 
## 
## We are selecting variables based on p value...
## 
## Variables Entered/Removed: 
## 
## - log(X4) added 
## 
## No more variables to be added/removed.
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.954       RMSE               0.479 
## R-Squared               0.910       Coef. Var          7.526 
## Adj. R-Squared          0.907       MSE                0.230 
## Pred R-Squared          0.896       MAE                0.353 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     64.964         1         64.964    282.937    0.0000 
## Residual        6.429        28          0.230                      
## Total          71.393        29                                     
## --------------------------------------------------------------------
## 
##                                  Parameter Estimates                                  
## -------------------------------------------------------------------------------------
##       model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
## -------------------------------------------------------------------------------------
## (Intercept)    4.189         0.156                 26.803    0.000    3.868    4.509 
##     log(X4)    1.259         0.075        0.954    16.821    0.000    1.106    1.413 
## -------------------------------------------------------------------------------------
# k
#plot(k)

# Stepwise AIC Regression for all log model #
k <- ols_step_both_aic(model_wf_all_log)
## Stepwise Selection Method 
## -------------------------
## 
## Candidate Terms: 
## 
## 1 . log(X1) 
## 2 . log(X2) 
## 3 . log(X3) 
## 4 . log(X4) 
## 5 . log(X5) 
## 6 . log(X6) 
## 7 . log(X7) 
## 8 . log(X8) 
## 9 . log(X9) 
## 
## 
## Variables Entered/Removed: 
## 
## - log(X4) added 
## - log(X8) added 
## - log(X9) added 
## - log(X6) added 
## - log(X1) added 
## - log(X3) added 
## 
## No more variables to be added or removed.
# k
# plot(k)

# Stepwise Regression based on p values for full log model #
# k <- ols_step_both_p(model_wf_full_log_inter)
# k
# plot(k)

# Stepwise AIC Regression for all full model #
# k <- ols_step_both_aic(model_wf_full_log_inter)
# k
# plot(k)

# Stepwise Regression based on p values for all log model #
# k <- ols_step_both_p(model_wf_all_log_inter)
# k
# plot(k)

# Stepwise AIC Regression for all log model #
# k <- ols_step_both_aic(model_wf_all_log_inter)
# k
# plot(k)

# Stepwise Regression based on p values for all log model #
# k <- ols_step_both_p(model_wf_mix2 )
# k
# plot(k)

# k <- ols_step_both_aic(model_wf_mix2)
# k
# plot(k)

# Stepwise Regression based on p values for X4 eliminated model#
# k <- ols_step_both_p(model_wf_rm4_log)
# k
# plot(k)

# Stepwise AIC Regression for X4 eliminated model#
# k<- ols_step_both_aic(model_wf_rm4_log)
# k
# plot(k)

# Stepwise Regression based on p values for X1 eliminated model#
# k <- ols_step_both_p(model_wf_rm1_log)
# k
# plot(k)

# Stepwise AIC Regression for X1 eliminated model#
# k<- ols_step_both_aic(model_wf_rm1_log)
# k
# plot(k)
# All Possible Regression for full log model #
# k <- ols_step_all_possible(model_wf_full_log)
# plot(k)
# head(arrange(k, desc(adjr)))

# All Possible Regression for all log model #
# k <- ols_step_all_possible(model_wf_all_log)
# plot(k)
# head(arrange(k, desc(adjr)))

# All Possible Regression for 3g log model #
#!!!!!!!!!!!! k <- ols_step_all_possible(model_wf_3g_log_inter)
# plot(k)
# head(arrange(k, desc(adjr)))
 
# All Possible Regression for mixed log model #
# k <- ols_step_all_possible(model_wf_mix2 )
# plot(k)
#  head(arrange(k, desc(adjr)))

# All Possible Regression for X4 eliminated model #
# k <- ols_step_all_possible(model_wf_rm4_log)
# k
# plot(k)

# All Possible Regression for X1 eliminated model #
# k <- ols_step_all_possible(model_wf_rm1_log)
# k
# plot(k)
#Lack of Fit F Test

ols_pure_error_anova(lm(y~X1, data = table_wf))
ols_pure_error_anova(lm(y~X4, data = table_wf))

alias(lm(y ~ as.factor(X3) + as.factor(X4) + as.factor(X5) + as.factor(X6) + as.factor(X7), data=table_wf))

alias(lm(y ~ as.factor(X1) + as.factor(X8) , data=table_wf))

alias(lm(y ~ as.factor(X4) + as.factor(X9) , data=table_wf))

alias(lm(y ~ as.factor(X3) + as.factor(X6) + as.factor(X7) + as.factor(X8) + as.factor(X9) , data=table_wf))

Final models

  • Check model all log 13689
summary(model_wf_rm2457_all_log)
## 
## Call:
## lm(formula = log(y) ~ log(X1) + log(X3) + log(X6) + log(X8) + 
##     log(X9), data = table_wf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71392 -0.10557  0.06481  0.11154  0.25306 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.87592    0.22021  26.684  < 2e-16 ***
## log(X1)      0.68118    0.02343  29.067  < 2e-16 ***
## log(X3)      0.37527    0.09489   3.955 0.000591 ***
## log(X6)     -0.33576    0.08150  -4.120 0.000389 ***
## log(X8)      1.72499    0.16768  10.288 2.82e-10 ***
## log(X9)     -1.45653    0.14975  -9.727 8.42e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2148 on 24 degrees of freedom
## Multiple R-squared:  0.9845, Adjusted R-squared:  0.9813 
## F-statistic: 304.6 on 5 and 24 DF,  p-value: < 2.2e-16
Anova(model_wf_rm2457_all_log)
Sum Sq Df F value Pr(>F)
39     1 845   3.22e-20
0.722 1 15.6 0.000591
0.783 1 17   0.000389
4.88  1 106   2.82e-10
4.37  1 94.6 8.42e-10
1.11  24           
ols_regress(model_wf_rm2457_all_log)
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.992       RMSE               0.215 
## R-Squared               0.984       Coef. Var          3.374 
## Adj. R-Squared          0.981       MSE                0.046 
## Pred R-Squared          0.974       MAE                0.141 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     70.285         5         14.057    304.557    0.0000 
## Residual        1.108        24          0.046                      
## Total          71.393        29                                     
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)     5.876         0.220                 26.684    0.000     5.421     6.330 
##     log(X1)     0.681         0.023        0.908    29.067    0.000     0.633     0.730 
##     log(X3)     0.375         0.095        0.124     3.955    0.001     0.179     0.571 
##     log(X6)    -0.336         0.081       -0.109    -4.120    0.000    -0.504    -0.168 
##     log(X8)     1.725         0.168        0.543    10.288    0.000     1.379     2.071 
##     log(X9)    -1.457         0.150       -0.533    -9.727    0.000    -1.766    -1.147 
## ----------------------------------------------------------------------------------------
confint(model_wf_rm2457_all_log, level = 1-(0.05/6))
##                0.417 %   99.583 %
## (Intercept)  5.2428046  6.5090281
## log(X1)      0.6137985  0.7485522
## log(X3)      0.1024526  0.6480888
## log(X6)     -0.5700799 -0.1014438
## log(X8)      1.2429032  2.2070673
## log(X9)     -1.8870607 -1.0259933
# Collinearity Diagnostics #
 ols_vif_tol(model_wf_rm2457_all_log)
Variables Tolerance VIF
log(X1) 0.663 1.51
log(X3) 0.656 1.52
log(X6) 0.925 1.08
log(X8) 0.232 4.3 
log(X9) 0.216 4.64
#Model Fit Assessment
 ols_plot_diagnostics(model_wf_rm2457_all_log)

# Part & Partial Correlations
 ols_test_correlation(model_wf_rm2457_all_log) # Correlation between observed residuals and expected residuals under normality.
## [1] 0.9123016
# Residual Normality Test
 ols_test_normality(model_wf_rm2457_all_log) # Test for detecting violation of normality assumption. #If p-value is bigger, then no problem of non-normality #
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.846           5e-04 
## Kolmogorov-Smirnov        0.163          0.3630 
## Cramer-von Mises          6.9863         0.0000 
## Anderson-Darling          1.1772         0.0038 
## -----------------------------------------------
# Variable Contributions
 ols_plot_added_variable(model_wf_rm2457_all_log)

# Residual Plus Component Plot
 ols_plot_comp_plus_resid(model_wf_rm2457_all_log)

  • Check model mix1
summary(model_wf_mix1)
## 
## Call:
## lm(formula = log(y) ~ (X1 + X3 + X4)^2 + log(X8) + log(X9), data = table_wf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58067 -0.07614 -0.01057  0.08042  0.34122 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.59769    0.22800  11.394 1.07e-10 ***
## X1           1.28956    0.23173   5.565 1.36e-05 ***
## X3           0.29604    0.03971   7.456 1.86e-07 ***
## X4           0.39072    0.04174   9.361 3.96e-09 ***
## log(X8)      1.57550    0.17201   9.159 5.82e-09 ***
## log(X9)     -1.34486    0.15424  -8.719 1.37e-08 ***
## X1:X3       -0.38088    0.05579  -6.827 7.38e-07 ***
## X1:X4        0.03070    0.01243   2.471   0.0217 *  
## X3:X4             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2266 on 22 degrees of freedom
## Multiple R-squared:  0.9842, Adjusted R-squared:  0.9791 
## F-statistic: 195.5 on 7 and 22 DF,  p-value: < 2.2e-16
Anova(model_wf_mix1)
Sum Sq Df F value Pr(>F)
1.19  1 23.2  8.24e-05
0.482 1 9.38 0.00569 
6.84  1 133    8.35e-11
4.31  1 83.9  5.82e-09
3.9   1 76    1.37e-08
     0            
     0            
     0            
1.13  22            
# Collinearity Diagnostics #
ols_vif_tol(model_wf_mix1)
Variables Tolerance VIF
X1 0     Inf   
X3 0     Inf   
X4 0     Inf   
log(X8) 0.246 4.07
log(X9) 0.226 4.42
X1:X3 0     Inf   
X1:X4 0     Inf   
X3:X4 0     Inf   
#Model Fit Assessment
ols_plot_diagnostics(model_wf_mix1)

# Part & Partial Correlations
ols_test_correlation(model_wf_mix1) # Correlation between observed residuals and expected residuals under normality.
## [1] 0.9683687
# Residual Normality Test
ols_test_normality(model_wf_mix1) # Test for detecting violation of normality assumption. #If p-value is bigger, then no problem of non-normality #
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9448         0.1225 
## Kolmogorov-Smirnov        0.1306         0.6387 
## Cramer-von Mises          6.871          0.0000 
## Anderson-Darling          0.5822         0.1182 
## -----------------------------------------------
# Variable Contributions
ols_plot_added_variable(model_wf_mix1)

# Residual Plus Component Plot
ols_plot_comp_plus_resid(model_wf_mix1)

  • Check model all log 134689
 summary(model_wf_rm257_all_log)
## 
## Call:
## lm(formula = log(y) ~ log(X1) + log(X3) + log(X4) + log(X6) + 
##     log(X8) + log(X9), data = table_wf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70374 -0.09978  0.03781  0.11788  0.24638 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.29728    0.49196  10.768 1.86e-10 ***
## log(X1)      0.46313    0.16797   2.757 0.011217 *  
## log(X3)      0.24862    0.13447   1.849 0.077357 .  
## log(X4)      0.41702    0.31819   1.311 0.202940    
## log(X6)     -0.32609    0.08065  -4.043 0.000505 ***
## log(X8)      1.71460    0.16541  10.366 3.87e-10 ***
## log(X9)     -1.45039    0.14763  -9.824 1.07e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2117 on 23 degrees of freedom
## Multiple R-squared:  0.9856, Adjusted R-squared:  0.9818 
## F-statistic: 261.7 on 6 and 23 DF,  p-value: < 2.2e-16
 Anova(model_wf_rm257_all_log)
Sum Sq Df F value Pr(>F)
0.341 1 7.6  0.0112  
0.153 1 3.42 0.0774  
0.077 1 1.72 0.203   
0.733 1 16.3  0.000505
4.82  1 107    3.87e-10
4.33  1 96.5  1.07e-09
1.03  23            
# Collinearity Diagnostics #
# ols_vif_tol(model_wf_rm257_all_log)

#Model Fit Assessment
# ols_plot_diagnostics(model_wf_rm257_all_log)

# Part & Partial Correlations
# ols_test_correlation(model_wf_rm257_all_log) # Correlation between observed residuals and expected residuals under normality.

# Residual Normality Test
# ols_test_normality(model_wf_rm257_all_log) # Test for detecting violation of normality assumption. #If p-value is bigger, then no problem of non-normality #

# Variable Contributions
# ols_plot_added_variable(model_wf_rm257_all_log)

# Residual Plus Component Plot
# ols_plot_comp_plus_resid(model_wf_rm257_all_log)
  • Check model all log 135689
summary(model_wf_rm247_all_log)
## 
## Call:
## lm(formula = log(y) ~ log(X1) + log(X3) + log(X5) + log(X6) + 
##     log(X8) + log(X9), data = table_wf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67722 -0.08003  0.01102  0.13879  0.25715 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.57120    3.36000   0.170  0.86650    
## log(X1)      0.72550    0.03608  20.107 4.31e-16 ***
## log(X3)      0.41866    0.09605   4.359  0.00023 ***
## log(X5)      1.25873    0.79566   1.582  0.12731    
## log(X6)     -0.26702    0.09022  -2.960  0.00702 ** 
## log(X8)      1.62253    0.17508   9.267 3.15e-09 ***
## log(X9)     -1.37489    0.15416  -8.919 6.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2084 on 23 degrees of freedom
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.9824 
## F-statistic: 270.1 on 6 and 23 DF,  p-value: < 2.2e-16
Anova(model_wf_rm247_all_log)
Sum Sq Df F value Pr(>F)
17.6   1 404    4.31e-16
0.825 1 19    0.00023 
0.109 1 2.5  0.127   
0.38  1 8.76 0.00702 
3.73  1 85.9  3.15e-09
3.45  1 79.5  6.33e-09
0.999 23            
# Collinearity Diagnostics #
# ols_vif_tol(model_wf_aic_all_log_inter)

#Model Fit Assessment
# ols_plot_diagnostics(model_wf_aic_all_log_inter)

# Part & Partial Correlations
# ols_test_correlation(model_wf_aic_all_log_inter) # Correlation between observed residuals and expected residuals under normality.

# Residual Normality Test
# ols_test_normality(model_wf_aic_all_log_inter) # Test for detecting violation of normality assumption. #If p-value is bigger, then no problem of non-normality #

# Variable Contributions
# ols_plot_added_variable(model_wf_aic_all_log_inter)

# Residual Plus Component Plot
# ols_plot_comp_plus_resid(model_wf_aic_all_log_inter)